Importing all the required libraries

Importing the dataset

Hadling Missing values

We have cut down the data into only the high value customers and we got 29994 rows as mentioned in the problem stmt

Creating a dipendent variable

Removing all the variables with '_9'

Data visualisation

We can see there is a lot of imbalence in the churn dat. We will deal with it in later stage

Clearly there are a lot of outliers in the data

We can remove thse outliers as they are less than 3 percent of the data. It wont effect the entire data

Now it looks better than before

Handling Imbalanced data

Now everything is ready for performing imbalance handling

Oversampling

Performing PCA

Performing all the Imbalance handling methods and Comparing the models

Model-1: Logistic Regression

1. Original Unsampled Data

2.SMOTE Resampling

3.ADASYN Resampling

4.SMOTE + Tomek Resampling

5.SMOTE + ENN Resampling

Model-2: Decision Tree

1. Original Unsampled Data

2.SMOTE Resampling

3.ADASYN Resampling

4. SMOTE + Tomek Resampling

5.SMOTE + ENN Resampling

Model-3: Random Forest

1. Original Unsampled Data

2.SMOTE Resampling

3.ADASYN Resampling

4. SMOTE + Tomek Resampling

5. SMOTE + ENN Resampling

Model Comparision

The best model from all of them is the Rnadome forest with no resampling i:e actual data

Now lets extract the best features from the random forest we just built

These are the top 10 features that playes major role.